Fault Tolerance

 

Review: write about fault tolerance problem being addressed, high level technique: replication (where and how), re-execution, etc.

 

Comments from review:

-       questioning assumptions about trends (e.g. self configuring systems, hardware getting more reliable)

o      QUESTION: What is really happening?

-       how accurate is the data? It came from one system

o      QUESTION: what do you expect?

 

  1. Intro
    1. Reliability: how long do you execute before a failure

                                              i.     MTTF

    1. Availability: what is probability if you request service you get it

                                              i.     MTTF / MTTF + MTTR

                                             ii.     How make high availability?

1.   Make MTTF big (highly reliable) or MTTR small (fast to repair)

                                           iii.     99%                   ~3 days

                                           iv.     99.9%        ~9 hours

                                            v.     99.99%      ~1 hour

                                           vi.     99.999%    ~5 minutes

                                         vii.     99.9999% ~30 seconds

    1. What is cost of an hour of downtime (in 2002)?

                                              i.     Brokerage: $6,000,000

                                             ii.     Ebay: $225,5000

                                           iii.     Cell phone activation: $41,000

                                           iv.     Home shopping channel: $113,000

    1. What is MTTF for a disk?

                                              i.     900,000 hours – 10 years

    1. What is MTTF for an OS?

                                              i.     Windows 2000: 72 weeks

    1. Failures

                                              i.     Terminology:

1.   Fault = bug in code

2.   Error = erroneous state as a result of executing code

a.    Latent errors: executed fault but did not cause failure yet

3.   Failure = system does not act according to its specification

                                             ii.     Types

1.   Bohr bugs / deterministic bugs:

a.    Bugs that recur every time you do something – easily repeatable / predictable / can be tracked down and fixed / often found in testing

2.   Heisenbugs / nondeterministic bugs

a.    Bugs that donŐt recur every time / caused by an unlikely combination of events / hard to reproduce and repair

                                           iii.     Causes of failure

1.   Hardware (cpu, devices) – 18%

2.   Environment (network, power) – 14%

3.   Software (OS, applications) – 25%

4.   Operations (maintenance, administration) – 42%

                                           iv.     When do failures occur?

1.   Infant mortality – new, under tested

2.   Norma lifetime – highly reliable

3.   Wear-out period (for HW) – things break physically, or (for SW) assumption about world have changed too much

                                            v.     Failure models – Why important?

1.   Timing failures occur when a component violates timing constraints.

2.   Output or response failures occur when a component outputs an incorrect value.

3.   Omission failures occur when a component fails to produce an expected output.

4.   Crash failures occur when the component stops producing any outputs.

5.   Byzantine or arbitrary failures occur when any other behavior, including malicious behavior, occurs

 

                                           vi.     Synthetic failure models

1.   Halt on failure

2.   Failure status

3.   Stable Storage

                                         vii.      

    1. Approaches:

                                              i.     Fault Avoidance: make sure failures donŐt happen

1.   Fault prevention: write code without bugs

a.    better languages

b.    better software engineering

c.    tool usage during coding process

d.    e.g. write a new OS in a new language, prove properties of implementation

2.   Fault removal: remove bugs from code

a.    e.g. run testing tool (valgrind, purify)

b.    windows static driver verifier – find bugs statically

3.   Fault workaround: make sure failures donŐt execute

a.    Firewall / virus detector

b.    ŇIt hurts when I runÓ ˆ ŇdonŐt runÓ

                                             ii.     Fault Tolerance

1.   Allow failures to occur, but keep system running

2.   Basic ideas:

a.    Fault detection – figure out that something bad happened

b.    Isolation – keep bad state from spreading to whole system

c.    Recovery – get the bad part back into a good state

3.   Basic approaches to error detection

a.    Check dynamically for error conditions and inconsistencies to detect failures early

b.    Use heart beats to make sure a module is still executing

c.    QUESTION: how easy it to do this generically?

                                                                                                    i.     QUESTION: as code evolves?

                                                                                                   ii.     QUESTION: at what cost?

4.   Basic approaches to isolation

a.    Decompose into modules

                                                                                                    i.     Unit of failure is small

b.    Check each module for errors

                                                                                                    i.     Fails fast – doesnŐt spread corruption

                                                                                                   ii.     Isolate from other modules

c.    Hardware / software boundaries around modules

                                                                                                    i.     Whole machine

                                                                                                   ii.     address space

                                                                                                 iii.     extra instructions

5.   Basic approaches to recovery

a.    Restore system to a functioning state

                                                                                                    i.     E.g. configure extra modules to take over for failed module, restart failed module

b.    Forwards / Backwards

c.    Concealing / revealing

d.    Basic approaches:

                                                                                                    i.     Logging / retry

                                                                                                   ii.     Checkpoint / restore

                                                                                                 iii.     Replicate (process pairs)

                                                                                                 iv.     Alternate versions

                                                                                                  v.     Transactions (undo)

                                                                                                 vi.     Reveal faults up the stack

e.    Concepts:

                                                                                                    i.     Have multiple Ys, Multiple Xs that are identical. Switch between Xs when Ys fail

1.   Fault Tolerance

                                                                                                   ii.     Isolate X from Y so survival of X does not depend on Y

1.   Fault Containment

2.   Some useful things fail, but not all - partitioning

f.     Redundancy: do things twice or more

                                                                                                    i.     On two machines

                                                                                                   ii.     In two processes

                                                                                                 iii.     In two places (state in memory / on disk checkpoint)

                                                                                                 iv.     At two times (e.g. checkpoint / restore)

                                                                                                  v.     QUESTION: what kinds of bugs are handled?

g.    Diversity: do things multiple different ways

                                                                                                    i.     Different platforms

                                                                                                   ii.     Different implementations

                                                                                                 iii.     Idea: unlikely to have common failure modes

                                                                                                 iv.     Name: n-version programming, recovery blocks

6.   Basic questions for fault tolerance: where do you do the fault tolerance?

a.    In the hardware (e.g. two processors, RAID with multiple disks)

b.    Between the HW and the OS (e.g. virtual machine)

c.    Within the OS

d.    Between the OS and the application

e.    Within the application

7.   General principle:

a.    If everything above layer X is identical, can tolerate faults at X or below automatically

                                                                                                    i.     E.g. FT unix -> HW, OS faults

                                                                                                   ii.     E.g. Hypervisor -> HW faults

                                                                                                 iii.     E.g. Nooks -> driver faults (everything else is above)

                                                                                                 iv.     E.g. Disco -> OS faults

b.    If have some diversity above X, can tolerate heisenbugs above layer X

                                                                                                    i.     Process pairs – execute different streams

                                                                                                   ii.     Checkpoint / restart: if restart far enough back

  1. Goals for fault tolerance
    1. High performance

                                              i.     Not much additional cost over unreliable

    1. Low cost

                                              i.     Not much additional hardware or software

    1. Transparent to existing code

                                              i.     Can make existing programs / os more reliable

    1. Tolerates lots of failures

                                              i.     Hardware

                                             ii.     Software

                                           iii.     human

  1. (Gray) Approaches to Redundancy
    1. Process pairs:

                                              i.     Run two copies

                                             ii.     Switch from one to the other on failure of one

    1. How to use:

                                              i.     Lockstep processes – HW failures only

1.   Both CPU do same work, no extra capacity

                                             ii.     Explicit State checkpoints – do computation, send state changes to backup

1.   Backup can do computations from latest state

                                           iii.     Automatic checkpoints – log messages

1.   Inefficient – donŐt know what to checkpoint, must send everything

                                           iv.     Delta checkpoints – send operations, not state

1.   re-execute on other side. Reduces bandwidth

                                            v.     Persistent processes – only replicate persistent data and session existence, not transient per-session data – internal in-memory data structures

1.   Make state changes persistent: e.g. all on disk

2.   On failure, backup wakes up knowing sessions but not state

3.   On failure, internal state is in unknown, inconsistent situation

    1. Transactions

                                              i.     Group of operations that form a consistent transformation of state - ACID

1.   Atomic – all or nothing

2.   Consistent – every transactional execution sees a correct picture of the state, even if other transactions are excuting

3.   Integrity – is a correct state transformation

4.   Durable – transactions had effects even if a failure occurs after transaction

                                             ii.     Operations

1.   Begin transaction

2.   Commit – make effects durable

3.   Abort – undo partial effects

                                           iii.     Use for fault tolerance

1.   Allows use of persistent process pairs

a.    Allows undo of actions in a transaction that aborted

b.    Allows reset of system to known good state

                                           iv.     QUESTION: What is great about transactions?

1.   Can reason about state of system with failures

                                            v.     QUESTION: Why not

1.   Programming cost

2.   Performance cost – extra communication

                                           vi.     QUESTION: What is MTTR here?

1.   Must detect failure

2.   Backup must abort in-progress transactions

a.    no state to sync or log to replay

    1. FT communication

                                              i.     Session abstraction

1.   Sequenced

2.   Retry on alternate path if path fails

3.   Notify endpoints if all paths fail

4.   Sessions handle switching to backup automatically if primary fails

5.   On TX abort, sequence number reverts to beginning of transaction, intervening messages cancelled

    1. FT storage

                                              i.     Store on multiple disks

                                             ii.     Many replication options – take 739 for details

                                           iii.     Transactions + logs for ensuring storage updated consistently